A Machine Learning Method for Allocating Scarce COVID-19 Monoclonal Antibodies

Key Points Question Can the use of a policy learning–based allocation method improve population health benefits achieved when allocating scarce treatments? Findings This retrospective cohort study examined methods for allocating scarce neutralizing monoclonal antibodies, using electronic health record data from more than 15 000 patients with COVID-19 within a large health care system, and found that a policy learning tree–based allocation method would potentially have resulted in lower hospitalization rates compared to the observed data. Further, a point system based on the policy learning trees outperformed another commonly used point-scoring system. Meaning Using electronic health record data to show that machine learning methods, namely policy learning trees, can improve the allocation of scarce therapeutics; therefore, policy learning tree–based allocation should be considered in potential future episodes of therapeutic scarcity, including pandemics.

testing cohort.October to December was the time when Colorado implemented crisis standards of care due to healthcare strain and faced an increased amount of hospitalizations and patient visits than in the summer of 2021. 1 Thus, the sample size of 10/1-12/11 was large (n=9,542), making this period suitable as a training cohort.In addition, 41% of the population were treated among the training cohort (3,912 treated patients) such that PLTs have a large number of treated patients to model the relationship between treatment assignment and patient covariates.We chose June to October period to be the testing cohort because June to October period experienced the most severe mAb shortage, which enabled us to evaluate the proposed method under resource scarcity setting (21% were treated compared to 41% were treated in the training cohort).We still conducted a sensitivity analysis of our model to a random splitting approach and results are in eAppendix 2..

B. Policy Learning
Suppose the patient outcome of interest is   ,   ∈ {0, 1} is the treatment status,   is the vector of patient covariates used for allocation, for each patient  and  ∈ {1, 2, … , }.
Policy Learning (PL) models the relationship between observed covariates and the treatment allocation and finds an allocation policy  � among a class of policies Π (e.g., a finite depth of decision trees) that maps patient covariates    (   may be a subset of The primary purpose of our paper is to introduce the PLTs framework for allocating a general treatment; the PLTs model we discussed assumes all mAbs share the same outcome and we combined all mAbs into a general treatment arm.

Strengths.
Compared to allocation models that typically model outcome risk factors without treatment information, PL models the effect modification by covariates that inform which covariates an allocation should prioritize and use to achieve the best overall outcome/hospitalizations. 2,3PL also has theory guarantees for finding a policy with bounded regret, i.e., the difference of policy values between the PL-learned policy and the optimal policy in a pre-specified set of allocation rules.It also applies to settings with unmeasured confounding and instrumental variables. 2The pre-specified set of allocation rules in PL can be a class of constrained policies such as the k-depth decision trees that we focus on in this paper, outcome constraints, or covariates constraints.Lastly, the PL model can be generalized to multi-arm settings. 6,7mitations.Since PL is a new method, software implementations of PL are currently limited to the class of policy with decision trees in the 'policytree' R package.This package requires covariates data to be complete and excludes individuals with missing covariates used for allocation, and this handling of missing data is similar to many prediction models.If the missing mechanism is missing at random, one may use multiple imputation by chained equations to impute the missing data. 8In addition, it is possible to construct a set of allocation rules with certain constraints manually, but the implementation is not built into existing R packages or other software platforms.Another limitation of PL with a decision tree is that the estimate can have large variations with small changes in data, which motivated us to develop a Policy Learning Tree ensemble to address this problem.

C. Policy Learning Trees
To reduce the variability of policy learning results, we built an ensemble of 130 policy learning trees (PLTs) and each policy is presented as a decision tree.The PLT-based allocation was obtained from the majority voting and was assessed by the expected overall hospitalization reduction from treatment in the testing cohort.To determine the best-performing policy tree structure that maximizes overall treatment effect under allocation, we evaluated each of the policy trees in the corresponding out-of-bag (OOB) samples, 9 where the performance metric is the policy value function  Γ �  = (  , 1) − (  , 0) + where all functions were estimated from the bagging samples (non-OOB samples) and (  , ⋅) is the estimate of counterfactual response under treated (1) or untreated (0).iv.
Policy Learning Tree (Stage II): Pre-specify the set of allocation rules Π then find  � by Equation S1.We pre-specified the policy tree class to a classification/decision tree with depth 4 and an exact tree search depth of 3 and minimal leaf node size of 50 samples.Then, we used policy tree search algorithm that search a decision tree policy  � that maximizes Steps i-iii are Stage I and step iv is Stage II in eFigure 1 where steps i-ii and steps iii-iv used different data.Steps i-ii apply to the re-sampled training data (62%-64% of the training data) and step iii applies to the out-of-bag samples (36%-36% of the training data).In causal forests, we used leave-one-out cross-validation to tune the fraction of samples, number of variables tried at each split, minimum number of observations in each tree leaf, and maximum imbalance of a split in the three random forest models in Stage I, and there were 4000 trees within all forest models.We implemented all Stage I models through the 'grf' R package. 11During policy learning, we calculated doubly robust treatment effect estimates from causal forest outputs. 12We then applied the treatment effect estimates as our optimization target to obtain the best allocation policy over depth of 4 decision trees through the 'policytree' R package (eFigure 1; Stage II).Positivity assumption, i.e., the treatment propensity is bounded away from 0 and 1; 4) Patient data are independent and identically distributed.The previous subsection discussed detailed assumptions behind Stage II.
Strengths of causal forests.The framework behind Steps i-iii applies to continuous, binary, count, or any outcome type that has a finite second moment. 5It can estimate CATE in data with missing covariates.It has underlying theory supports for being a consistent estimate of CATE through the doubly robust estimator. 3,5The framework is flexible in estimating treatment propensities and marginal outcome models, as long as the estimate converges sufficiently fast. 4,5Thus, it allows us to model interactions and non-linear effects of patient covariates on the observed treatment allocation and the outcome.Here, we focus on using the causal forest which is a powerful tool to estimate conditional average treatment effect in biomedical applications. 13,14The implementation of causal forests is available in the 'grf' R package.Limitations.A PLT may require a large sample size to achieve the optimal policy value.
The estimation of CATE from causal forests also requires a large sample size, especially in the setting of a sparse binary outcome.Tree ensemble methods also can be time-consuming for continuous covariates.The computational time increases with sample sizes, number of covariates, number of levels in covariates, and tree parameters such as the tree depth.Running each tree in a PLT ensemble took about 1.2 minutes.Despite the computational time, we recommend using a validation dataset (e.g., a of OOB samples) to grid search the best parameters for PLT ensemble.Causal forests may not be able to uncover high-dimensional covariate interaction and may violate the positivity assumption with high-dimensional covariates.
More extensive studies of the limitations of causal forests are in previous papers. 16,17 Covariate Importance Calculation We calculated the variable importance as the weighted sum of frequencies of a covariate was split on at each depth among all trees in the PLTs ensemble.Higher weights were given to parent nodes than to child nodes.Specifically, we gave a weight of 1/ depth^2 to node/nodes at each depth of the tree, following the variable importance calculation in the generalized random forest. 11

F. Policy Learning Trees Point System Details
We developed data-driven point systems based on PLT-based regression models.
Specifically, we implemented a forward model selection from a baseline model using the top five important covariates found by PLTs and set the upper bound for the model selection to be a complex regression model suggested by the PLTs ensemble (such as the interaction between the fully vaccinated status and cardiovascular comorbidity).The complex regression model regressed estimated conditional causal risk differences on variables within identified treatment allocation subgroups in the policy tree.Subgroups are defined by variable combinations/interactions that enhance the treatment effect from observed data and lead to treatment allocation in the tree node (see "Policy Learning" section).Those variables and the four-/three-/two-way interactions suggested by policy tree are used as the upper complex model for a forward model selection from the baseline model with the top 5 important covariates in PLTs.The model selection threshold uses a p-value of 0.05 to add new variables. 18Following Sullivan et al., we converted coefficients for variables, including variable interactions, in the final regression models into point systems to predict the causal mAbs treatment effect that guides mAbs allocation. 19The points given to each variable and the interaction terms were determined by dividing the regression coefficient of each variable by the parameter estimate of the variable with the smallest absolute value and then rounding to the nearest integer. 20,21sumptions.The point system framework of Sullivan et al. depends on a linear regression model between covariate and quantity used for allocation.In our case, we used the doubly robust conditional treatment effect estimate as the allocation principle to model how the covariate modifies the doubly robust conditional treatment effect estimates.
Performance Metric.We compared point systems across various hypothetical treatment proportions to no treatment rather than to the observed allocation, because the observed allocation treated only a specific proportion of the population.

G. PLT-based Allocation for Evolving COVID-19 Conditions
© 2024 Xiao M et al.JAMA Health Forum.
The PLTs allocation model focuses on which patient covariates modified the effect of mAb rather than the treatment effect, so it may be less sensitive to the evolving variants and potential lack of mAb utility than the treatment effect analysis.For example, Sotrovimab, which lost its effectiveness during the Omicron period, still had a larger and more significant treatment effect among patients older than 65 years compared to those who were younger than 65 years. 22This effect modification direction was consistent with the direction of change in mAb treatment effectiveness by age in previous reports. 23

H. Individual Patient Benefit Graph
We recommend building visualization of PLTs results and infographics that may help to promote equitable access to care, especially in the patient outreach messaging and community partnership in educating the treatment benefits.To visualize PLT's results, we built an individual patient benefit (hospitalization risk reduction) graph where we calculated the causal mean risk reduction from treatment (negative of the causal risk difference) among patients that share the same allocation point in the testing data cohort.We excluded those allocation points if the number of patients that shared the specific allocation point was less than 10 to reduce our results' sensitivity to random noise.Then, we fit a linear regression to regress the mean risk reduction on PLT-based points.We presented these steps here as a simple illustration of converting PLTs into visualization, and refining these steps is still needed in future research.

A. Data Splitting Sensitivity Analysis Results
We conducted a sensitivity analysis by randomly splitting 60% of the entire data from 6/2021 to 12/2021 (n=15,790) into a training cohort (n=9,474) and the remaining 40% as our testing cohort (n=6,316).The overall expected hospitalization was reduced by 1.4% (95% CI: -2.4%, -0.5%) compared to the observed allocation in the testing cohort after random splitting (the main result showed a reduction of 1.6% with a slightly narrower 95% CI: -2.0%, -1.2%).We also observed that the top important variables in PLTs were similar to our main finding (eFigure 5A).Thus, the PLT-based regression model would start from the same base model in the forward model selection process, although the best PLT and its covariate interaction information were different, as this value to get an NNT to be 13 and calculated the proportion of patients who had greater than or equal to this benefit; in this case, 6% of the total patients we have calculated the PLT-based points had a higher or equal benefit from the treatment than this patient.We conveyed the meaning of 6% as "You are among the top 6% to benefit from mAb treatment", and we explained what their NNT meant by "Reducing 1 hospitalization after treating 13 people like you" with graphics.Since causal risk differences come from doubly robust conditional treatment effects, the interpretation applies to patients who share similar covariates, and we added "people like you" in our NNT interpretation.

B. Model Quality Checks -Sample Size
The PLT framework here is a nonparametric method that requires a large sample size (though our training and testing samples were both larger than 6000), and it may perform poorly when the training sample size is small.When the training sample size is small, we suggest exploring the use of parametric methods, such as penalized regression, until the sample size reaches a sufficient number determined by simulations. 24,25For example, we found the sample size of 250 to be sufficiently large for the PLT to reach the optimal performance, through a preliminary simulation (eFigure 8; Simulation details below).Users and policymakers can apply this simulation to determine the acceptable amount of difference from the optimal policy/risk reductions (under the best and true policy in the testing data).For example, a training sample size of 250 had near-optimal results via the visual assessment (eFigure 8), and our training sample size (>5,000) was significantly larger than 250, leading to a positive conclusion about our PLT model quality.
Simulation details.We did a preliminary simulation to show that simulations can help to assess feasible sample sizes for a reliable PLT.We designed our simulation parameters to match raw outcome risks we observed in the training cohort, i.e., 3

eAppendix 1 .
Methods Details We first described our training/testing data choice.Then, we outlined details of Policy Learning and Policy Learning Trees, their assumptions, strengths, and limitations.We further elaborated on our innovations, including details behind Policy Learning Trees ensemble (under Policy Learning Trees), details behind point systems and its basis on Policy Learning ensemble's covariate importance, and model check details.We also explained methods we used to create the individual patient benefit graph.A. Training and Testing Data Splitting We divided our data from 6/2021 to 12/2021 with Delta variant phase of COVID-19 into 10/1/21-12/11/2021 (training) and 6/1/2021-9/30/2021 (testing) according to clinical knowledge about these two time periods and how they fit into the use as the training vs.

1 𝑛𝑛∑ 2 (Assumptions. 1 ) 1 √𝑛𝑛 4 achieved
) to treatment allocations ∈ {0, 1} which maximizes overall outcomes through a policy value function: 2  � =   � (   ) − 1)Γ  � ∶   ∈ Π�. (S1) Γ �  is the estimated doubly robust conditional average treatment effect (DR-CATE) and we would call this value as the causal risk difference for patient , and (   ) is the allocation based on the policy  using patient covariate values and (   ) ∈ {0 ,1} where 0 refers to no treatment and 1 refers to treatment.PL authors 2,3 mentioned that maximizing policy value function is equivalent to optimizing overall outcome and they showed that the maximizer policy  � has a bounded and decayed regret scaled with sample size .The regret is defined by the difference between the expected outcome by using  � and the best outcome that could be achieved among the policies in the set of allocation rules Π.The policy value function leveraged estimated CATE to find the optimal policy  �.The conditional treatment effect Γ  is a linear function of average individual counterfactual response.This can be achieved through the doubly robust estimator in Equation S2 (Policy Learning Trees Implementation Details below) if the © 2024 Xiao M et al.JAMA Health Forum.estimates of treatment propensity and outcome model converge sufficiently fast (at the rate of by most machine learning estimators); 4,5 2) estimate of counterfactual response and DR-CATE had finite variance and converge to the true value in probability at every possible combination of patient covariates and treatment status; 3) The set of allocation rules Π has a finite dimension (an intuitive analogy is a finite number of parameters needed to specify each policy in Π); 4) The coefficient of  � in the linear function to calculate Γ  is bounded for every combination of patient covariates and treatment status; and Assumptions 1, 2, 3, and 4 behind Stage I below.

2 +
ii. Causal forest model (Stage I): Define   � =   −  ̂− (  |  ) and  �  =   − ̂−  (  ) where the  ̂− (  |  ) and ̂−  (  ) come from trees in random forests where patient  is in the out-of-bag sample set.Then, obtain CATE ̂ by minimizing the loss function   � 1  ∑ �  � −  �  (  )�   ()� where   () is the regularization term for the complexity of (⋅) function.This step is done through causal forest.Each tree in the causal forest used about 50% of the training data to build a tree where child nodes were developed by maximizing treatment effect differences between nodes.The tree used sub-samples (50% of the data in building a tree) from the samples for treatment effect estimation to determine whether to split into child nodes.It uses an adaptive weight   (  ) in a Robinson estimator to estimate CATE   , 3,10 where   (  ) captures the frequency of training samples fall into the same leaf as patient with covariates   .iii.Doubly-robust estimator (Stage I): Use a doubly robust estimator of the conditional treatment effect (DR-CATE) of Γ  � by augmented inverse probability weighting among OOB samples: 2 expected.Finally, by developing the PLT-based point system from the training data after the random split, we showed the points trained by two data splitting approaches had a similar distribution among 2,511 overlapping testing patients between the random testing cohort and the testing cohort after the clinical split (eFigure 6).B.Interpretation of Individual Patient Benefit GrapheFigure7 shows that an increasing hospitalization risk reduction from no treatment (causal risk difference) is associated with higher PLT-based points.Color gradients in the grid corresponded to different levels of number needed to treat (NNT).When incorporating this into patient and provider facing materials through community engagement and user interface design, we suggest explaining NNT and their point's implication from the data we used to build the graph.For example, if a patient had a point score of 12, we found the corresponding risk reduction (benefit).We then inversed © 2024 Xiao M et al.JAMA Health Forum.

eAppendix 3 .
-Practical Consideration of Real-Time PLT-Based Allocation During Resource Scarcity A. Real-Time Data Quality Checks Our team has established a real-world data platform that allowed us to produce reliable evidence from bi-weekly electronic health record data deliveries in real-time.Patientlevel data was interleaved with statewide vaccination records from the Colorado Comprehensive Immunization Information System, and mortality information from Colorado Vital Records.These data underwent rigorous data quality checks including comparative analysis with previous data deliveries, temporal evaluation of COVIDpositive dates in relation to mAb treatment and hospitalization events and imputing where missing, monitoring distribution of mAb treatment types through time and pandemic phase, and comparing the observed data with statewide epidemiological data.
Assumptions behind Stage I: 1) Stable Unit Treatment Values Assumption.The potential outcomes (e.g., expected hospitalization) of an individual only depend on the individual's received treatment, i.e.,   =   (  ) ; 2) Treatment is independent with potential outcomes after conditioning on   or there is no unmeasured confounding; 3) 15r innovation of the PLTs ensemble from the original PLT in Athey and Wager 2 allows us to analyze important patient covariates and construct a point system after PL procedure, and it uses re-sampling and OOB samples to reduce variability and overfitting compared to a single PLT.15 11Strengths of PLTs.
27% in treated group and 4.8% in untreated group.Data come from the model   =   (0) +   (  ), where we used following distributions to simulation model components: 55+  ) ), each covariate in   is binary and each was simulated from (0.7), treatment indicator   ∼ (1/(1 +  −( 4 + 5 +⋯+ 9 −4.2+  ) ), CATE (  ) = − +  2 +  3 +  4 ) +  1  3 − 0.128, and random error term   ∼ (0, 1).At a given training sample size (N = 50, 100, 150, 200, 250, 300), we ran 200 simulations to assess whether PLT can produce optimal value function in Equation S1 and optimal hospital risk reduction from the observed outcome risk (the performance metric in the main paper).The sample sizes for the testing data in our evaluation were the same as the training data.We used default parameters of the causal forest function in the 'grf' package and used depth of 4 policy trees.The performance metrics are the policy value function in Equation S1 and the risk reduction from the observed data that we used in the main paper.We followed some of the simulation parameters choices from Athey and Wager2and we acknowledge that users should refine our simulation set-ups further.C.Model Quality Checks -Robustness to Sensitivity AnalysisIt is important to show conclusion from the model is robust to data variations.We demonstrated the PLT model robustness to adding vs. excluding race variable in the .The testing cohort was comprised of patients during June 1, 2021, and September 30, 2021.b.The "Other" category included patients who reported their race as American Indian or Alaska Native, Native Hawaiian and Other Pacific Islander, Asian Indian, Chinese, Filipino, Japanese, Korean, Multiple Race, and Other and who reported their ethnicity as Non-Hispanic or unknown ethnicity.Allocation Thresholds and Number Needed to Treat Among Allocated the difference between the risk of hospitalization that we would have observed if the population had been treated and the risk of hospitalization that we would have observed if the population were not treated.PLT: Policy learning Trees; NNT: Number needed to treat.